DNA Research 20, 339-3 53, (201 3) 
Advance Access publication on 1 8 April 2013 



doi:1 0.1 093/dnares/dst014 



Next-Generation Annotation of Prokaryotic Genomes with EuGene-P: 
Application to Sinorhizobium meliloti 201 1 

Erika SaIIet 1>2 't, Brice Roux 1 < 2 't, Laurent Sauviac 1 - 2 , Marie-Francoise Jardinaud 1 - 2 - 3 , Sebastien Carrere 1,2 , 
Thomas Faraut 4 - 5 , Fernanda de Carvalho-Niebel 1 - 2 , Jerome Gouzy 1 - 2 , Pascal Gamas 1 ' 2 , Delphine Capela 1 ' 2 , 
Claude Bruand 1 - 2 , and Thomas Schiex 6 -* 

INRA, Laboratoire des Interactions Plantes-Microorganismes (LIPM), UMR 44 1 , Casta net-Tolosan F-31 326, France 1 ; 
CNRS, Laboratoire des Interactions Plantes-Microorganismes (LIPM), UMR 2594, Castanet-Tolosan F-31 326, 
France 2 ; INPT-Universite de TOULOUSE, ENSAT- Avenue de I'Agrobiopole, Auzevilie-Tolosane-3 1 326 - Castanet- 
Tolosan Cedex 3 ; INRA, Laboratoire de Genetique Cellulaire, UMR 444, Castanet-Tolosan F-31 326, France 4 ; ENVT, 
Laboratoire de Genetique Cellulaire, UMR 444, Castanet-Tolosan F-31 326, France 5 and INRA, Unite de Biometrie et 
d'lntelligence Artificielle, UR 875, Castanet-Tolosan F-31 326, France 6 

*To whom correspondence should be addressed. Tel. +33-561 285428. Fax. +33-561 285335. 
Email: thomas.schiex@toulouse.inra.fr 

Edited by Dr Kenta Nakai 

(Received 1 2 February 201 3; accepted 26 March 201 3) 
Abstract 

The availability of next-generation sequences of transcripts from prokaryotic organisms offers the op- 
portunity to design a new generation of automated genome annotation tools not yet available for prokar- 
yotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which 
combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the prediction 
process. This enables the automated prediction of coding sequences (CDSs), untranslated regions, tran- 
scription start sites (TSSs) and non-coding RNA (ncRNA, sense and antisense) genes. EuGene-P was 
used to comprehensively and accurately annotate the genome of the nitrogen-fixing bacterium 
Sinorhizobium meliloti strain 2011, leading to the prediction of 6308 CDSs as well as 1876 ncRNAs. 
Among them, 1280 appeared as antisense to a CDS, which supports recent findings that antisense tran- 
scription activity is widespread in bacteria. Moreover, 4077 TSSs upstream of protein-coding or non- 
coding genes were precisely mapped providing valuable data for the study of promoter regions. By 
looking for RpoE2-binding sites upstream of annotated TSSs, we were able to extend the S. meliloti 
RpoE2 regulon by ~ 3 -fold. Altogether, these observations demonstrate the power of EuGene-P 
to produce a reliable and high-resolution automatic annotation of prokaryotic genomes. 
Keywords: genome annotation; prokaryotes; RNA-Seq; rhizobium 



1 . Introduction 

With the new generation of sequencing (NGS) tech- 
nologies, bacterial and archeal genome projects now 
combine deep genomic sequencing with a variety of 
transcriptome libraries. 1-4 If the main motivation for 
transcriptome sequencing is usually the quantification 
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of gene expression, the transcribed sequences gener- 
ated by deep sequencing can also contribute to pro- 
karyotic genome annotation by the elucidation of 
gene structural features, including transcription start 
sites (TSSs), 5' and 3' untranslated regions (UTRs) and 
the identification of non-coding RNA (ncRNA) genes. 
The quantification of gene expression following deep 
cDNA sequencing is based on the number of reads 
that map to a given gene. Therefore, the development 
of genome annotation tools that enable a better 



©The Author 2013. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ 
licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. 
For commercial re-use, please contactjournals.permissions@oup.com. 



340 



S. meliloti Genome Annotation with EuGene-P 



[Vol. 20, 



delineation of transcriptsshould lead to a more reliable 
expression measurement. In the recent sequencing of 
bacterial and archeal genomes, the annotation has 
still been done manually owing to the lack of appropri- 
ate tools to integrate RNA-Seq data. 5 Indeed, most 
existing prokaryotic gene finders 6-9 or high-level bac- 
terial annotation systems 10,11 are based on genomic 
sequence analysis and cannot take into account avail- 
able expression data in the structural prediction. 
Expert annotation using RNA-Seq data has been recent- 
ly facilitated by the use of integrated tools, such as 
VESPA 12 or Microscope, 1 3 which allow to simultan- 
eously visualize genomic, transcriptomic, proteomic 
or syntenic data, but the ultimate curation process 
still remains laborious. 

With the tremendously increasing number of pro- 
karyotic genomes that is being sequenced, there is a 
clear need for automated prokaryotic genome anno- 
tation tools able to integrate the variety of inform- 
ative data that can be produced either by second- 
generation sequencing or by other high-throughput 
analyses, such as tiling arrays and proteomics. The de- 
velopment of such prokaryotic gene finders allowing 
not only the prediction of coding sequences (CDSs), 
but also TSSs and non-coding (nc) transcribed genes, 
should provide improved transcript quantification, 
facilitated identification of regulatory sequences up- 
stream of mapped TSSs and thus, easier analysis of 
gene regulation. Because of the higher complexity of 
eukaryotic gene structures and the usual availability 
of transcribed sequences (such as expressed sequence 
tags or ESTs), many eukaryotic gene finders already 
have the ability to integrate experimental evidence in 
their gene prediction process. For example, ESTs are 
exploited in EuGene 14 and Augustus, 15 GenomScan 16 
uses similarities with known proteins, whereas SGP/ 
SGP2 1718 and EuGene'Hom 19 integrates sequence con- 
servation with related organisms. 

In this work, we adapted the eukaryotic gene finder, 
EuGene 14,20 , to the specific requirements of gene 
identification in prokaryotes, where in particular over- 
lapping CDSs are relatively frequent. EuGene has 
already been used successfully to annotate a variety 
of eukaryotic genomes 21-27 and has shown its 
ability to quickly incorporate new types of informa- 
tion for enhancing its predictive power. The generic 
tool developed here, called EuGene-P, exploits high- 
throughput data, such as strand-specific RNA-Seq 
data, to qualitatively improve the prediction contents 
and to minimize manual expert annotation. The pro- 
duced annotation contains previously unpredicted 
important gene structure features such as 5' and 3' 
UTRs, as well as ncRNA genes (including antisense 
RNAs). The mathematical model behind EuGene-P 
and its modular software architecture based on 
plug-ins facilitate the integration of a variety of 



other high-throughput data, such as PET-Seq, mass 
spectrometry data, protein similarities, DNA homolo- 
gies, predicted transcription terminators and others. 
The source codes of EuGene-P are available under 
the open-source Artistic licence at https://mulcyber. 
toulouse.inra.fr/projects/eugene. A fully automated 
generic prokaryotic pipeline annotation relying on 
EuGene-P is under preparation and will be made 
available. 

We trained and used EuGene-Pforthe annotation of 
the nitrogen-fixing symbiont Sinorhizobium meliloti 
bacterial strain 201 1 (Sm201 1). Sinorhizobium meliloti 
is a Gram-negative bacterium belonging to the alpha 
subclass of Proteobacteria, which can live either free 
in the soil, or in symbiotic association with roots of 
legume plants such as the model legume Medicago 
truncatula. 26 The Sinorhizobium-Medicago symbiotic 
interaction leads to the formation of new root organs 
called nodules, within which bacteria differentiate 
into bacteroids that fix nitrogen to the benefit of the 
host plant. Both nodule organogenesis and bacteroid 
differentiation are complex developmental processes 
that involve deep reprogramming of gene expression 
in both organisms. 28-30 The 6.7-Mb genome of 
Sm2011 is composed of three replicons, one main 
chromosome and two megaplasmids called pSymA 
and pSymB. The Sm201 1 strain used in this study is 
closely related to the Sm1021 reference strain that 
was previouslysequenced. 31 Both strains are independ- 
ent spontaneous streptomycin-resistant derivatives of 
the parental SU47 strain. 32 Despite being originated 
from the same parental strain, a numberof phenotypic 
differences were reported, 33-38 which may be related 
to specific genetic differences. In this work, we deter- 
mined both the genome sequence and the transcrip- 
tome of the Sm201 1 strain under in planta and 
different growth conditions. These data were inte- 
grated into EuGene-P to refine and enrich the annota- 
tion of the S. meliloti genome sequence, notably to 
predict TSSs and ncRNA genes. 

2. Materials and methods 

2. 7. Bacterial strains and growth conditions 

The bacterial strain used inthisstudy wasthe strepto- 
mycin-resistant derivative of Sm201 1 (GMI1 1 49 5). A 
rpoE2 mutant derivative of this strain was generated 
as previously described. 39 Strains were grown under 
aerobic conditions at 28°C in Vincent minimal 
medium supplemented with disodium succinate and 
ammonium chloride as carbon and nitrogen sources 
as previously described. 40 Bacteria were collected 
either in a mid-exponential phase (OD 60 o = 0.6) or in 
an early stationary phase (~1 h 30 min after entry in 
a stationary phase, OD 60 o=1-2). Bacteria were 



No. 4] 



E. Sallet et al. 



341 



harvested by filtration on 0.2 |xm membranes, frozen 
in liquid nitrogen and stored at -80°C until RNA ex- 
traction. Bacterial cultures were collected from three 
independent biological experiments. 

2.2. Plant material and growth conditions 
Medicago truncatula cv Jemalong A1 7 seeds were 

germinated and transferred to aeroponic caissons as 
described, 41 under the following chamber conditions: 
temperature: 22°C; 75% hygrometry; light intensity: 
200 |jlE m~ 2 s -1 ; light-dark photoperiod: 16-8h. 
Plants were grown for 18 days in caisson growth 
medium 42 supplemented with 1 0 mM NH 4 N0 3 , 
before growth in nitrogen-free medium for 4 days 
prior to inoculation with S. meliloti. At 1 0 days post- 
inoculation, nodules were harvested on ice from at 
least 20 plants, immediately frozen in liquid nitrogen 
and stored at -80°C. Each biological repetition corre- 
sponded to an independent caisson, with ~40 plants 
per caisson. 

2.3. Sinorhizobium meliloti genome sequencing 
The genome of Sm201 1 was sequenced at the 

Genoscope (CNS, Evry, France) using fractions of 454 
Titanium (46 Mb), 454 paired ends (18 Mb, insert 
size: 8 kb) and lllumina single end reads (1 .2 Gb, read 
length: 76 nt), providing a 1 90-fold theoretical cover- 
age of the genome. The genome sequence was 
assembled as described in Supplementary Materials 
and Methods. The nucleotide sequences of Sm201 1 
and Sm1021 strains were compared using the glint 
software (Faraut T. and Courcelle E.; http://lipm- 
bioinfo.toulouse.inra.fr/download/glint/, unpublished) 
to identify polymorphic regions. A set of 7 1 mutations 
including 64 putative frameshifts were verified by 
Sanger sequencing of polymerase chain reaction 
(PCR) products surrounding these regions generated 
usingeitherSm201 1 orSml 021 32 DNA as a template. 
The genome sequence of Sm201 1 was submitted to 
Genbank under accession numbers CP004138, 
CP0041 39 and CP0041 40, and a browser was set up 
at https://iant.toulouse.inra.fr/S.meliloti201 1 . 

2.4. RNA preparations 

RNAs were prepared as described in Supplementary 
Materials and Methods. Briefly, total RNAs extracted 
from cultured bacteria and root nodules were 
depleted of ribosomal RNAs by an oligocapture strat- 
egy derived from the Plant Ribominus kit (Invitrogen), 
in which the oligonucleotide sets were specifically 
designed to target M. truncatula and S. meliloti 
rRNAs, as well as the highly abundant S. meliloti 
tRNA-Ala (see Supplementary Table S1 for oligo- 
nucleotide sequences). RNAs were then separated in 
two fractions, short (<200 nt) and long (>200 nt), 



using Zymo Research RNA Clean & Concentrator™-5 
columns (Proteigene). 

2.5. cDNA library preparation and lllumina 
sequencing 

Oriented sequencing with a RNA ligation procedure 
was carried out by Fasteris SA (Geneva, Switzerland) 
using procedures recommended by lllumina, with 
adaptors and amplification primers designed by 
Fasteris, unless specified. For small RNAs, the Small 
RNA Sequencing Alternative v1.5 Protocol (lllumina) 
was used, starting with ~500 ng RNAs that were 
treated with tobacco acid pyrophosphatase to 
remove triphosphate at 5' transcript ends and purified 
on acrylamide gel before and after the adaptor liga- 
tion step. The 3' adaptor was the Universal miRNA 
cloning linker (NEB). For large RNAs, the amount of 
starting RNAs was ~2 00 ng, and a fragmentation 
step by zinc during 8 min was included, after the 
lllumina procedure. The size of selected inserts was 
20-1 20 nt for short RNA libraries and 50-1 20 nt 
for long RNA libraries from cultured bacteria and 
1 50-250 nt for long RNA libraries from nodules. 
Libraries were sequenced either in paired end or in 
single end (Table 1 ). Raw sequence data were submit- 
ted to the Gene Expression Omnibus (GEO) database 
(Accession GSE44083). 

2.6. Read mapping 

Reads were mapped to the genome using the pro- 
cedure as described in Supplementary Materials and 
Methods. For paired-end reads, all positions between 
the two reads were considered as transcribed. All tran- 
scription data can be visualized in the genome 
browser (https://lipm-browsers.toulouse.inra.fr/gb2/ 
gbrowse/GMI1 1495-Rm201 1G). 

2.7. Semi-conditional random field and 
associated features 

The mathematical model of semi-conditional 
random field (CRF) 43 has been used for gene finding 
in the eukaryotic gene finders, such as CRAIG 44 and 
CONRAD, 45 and implicitly used in EuGene from its 
creation. The semi-CRF model in EuGene-P is used 
to define an optimal segmentation of each strand of 
the genomic sequence into a succession of biological- 
ly meaningful regions. For one strand, the segmenta- 
tion is defined by a succession of regions s = (si . . . 
s q ). Each region s, = {b iy /,-, t,) starts at position b iy has 
length /, and labels t,. A label can be any of {/G, 
UTR5', UIR, UTR3', ncRNA, CDS U CDS 2 , CDS 3 , CDS V2 , 
CDS 2:3 , CDS 1:3 }, where IC stands for intergenic, 
UTR5', UIR and UTR3' for untranslated regions of 
coding genes, ncRNA for non-coding RNA genes, CDS, 
for coding regions in frame / and CDS, :; - for overlapping 
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GEO sample code 


RNA samples 


RNA 
fraction 


Biological 
replicate 
number 


Sequencing 
process 


Number of unambiguously 
mapped reads or 
paired-reads 


GSM1 07? 


i 1 08 


Nodule 


Long 


1 


pe 


2 


x 54 nt 


79 339 


GSM1 07f 


51 09 


Nodule 


Long 


2 


pe 


2 


x 54 nt 


1 03 025 


GSM1 07i 


1110 


Nodule 


Long 


3 


pe 


2 


x 54 nt 


55 825 


GSM1 07? 


S1 1 1 


Nodule 


Short 


1 


pe 


2 


x 54 nt 


785 009 


GSM1 07 c 


1112 


Nodule 


Short 


2 


pe 


2 


x 54 nt 


1 503 684 


CSMl 071 


1113 


Nodule 


Short 


3 


pe 


2 


x 54 nt 


1 465 61 0 


GSM1 07? 


3114- 


Bacteria mid-exponential phase 


Long 


1 


se 


1 


x 50 nt 


4 1 58 264 


GSM1 07 1 


1115 


Bacteria mid-exponential phase 


Long 


2 


se 


1 


x 50 nt 


4 1 54 232 


GSM1 07 1 


1116 


Bacteria mid-exponential phase 


Long 


3 


se 


1 


x 50 nt 


2 873 524 


GSM1 07f 


5117 


Bacteria mid-exponential phase 


Short 


1 


pe 


2 


x 50 nt 


4 792 283 


GSM1 07i 


!1 1 8 


Bacteria mid-exponential phase 


Short 


2 


pe 


2 


x 50 nt 


5 390 729 


GSM1 07? 


!1 1 9 


Bacteria mid-exponential phase 


Short 


3 


pe 


2 


x 50 nt 


9 061 874 


GSM1 07i 


!1 20 


Bacteria stationary phase 


Long 


1 


se 


1 


x 50 nt 


2 1 02 607 


GSM1 07f 


5121 


Bacteria stationary phase 


Long 


2 


se 


1 


x 50 nt 


3 171 844 


GSM1 07? 


3122 


Bacteria stationary phase 


Long 


3 


se 


1 


x 50 nt 


2 953 260 


GSM1 07? 


3123 


Bacteria stationary phase 


Short 


1 


pe 


2 


x 50 nt 


1 1 368 031 


GSM1 07? 


3124 


Bacteria stationary phase 


Short 


2 


pe 


2 


x 50 nt 


5 960 882 


GSM1 07? 


3125 


Bacteria stationary phase 


Short 


3 


pe 


2 


x 50 nt 


5 559 756 



All RNA samples were depleted in ribosomal RNA using the RiboMinus™ protocol and separated in short (<200 nt) and 
long (>200 nt) fractions. Note that nodule libraries contain a mixture of S. meliloti and M. truncatula transcriptomes. 
Figures indicated here correspond toS. meliloti sequence reads only, 
pe, paired ends; se, single end. 



coding regions in frame ; and j. See Fig. 1 for an 
example. 

The linear semi-CRF model computes the score of a 
segmentation (s 1( . . .,s q ) of a given input sequence as 
a linear combination of functions representing indi- 
vidual features of the segmentation. Each feature 
scores a region s, based on its length /,-, its label t it 
the label of the previous segment t,-_i and some evi- 
dence x (including the DNA sequence). EuGene-P 
relies more specifically on three types of features: 

(i) Contents features, cffc(s,-), score the fact that a 
region s,- has received label f,-. For example, if 
the nucleotides in the region s,- appear in an 
alignment with a known protein, a 'protein align- 
ment' feature will score positively if the asso- 
ciated label tj represents a coding region in the 
frame/strand indicated by the alignment. 

(ii) Signal features, sf^(t ( -_i score the fact that a 
region s, with label t,- starts at position b- } after a 
region with label t-,-\. For example, a 'RNA-Seq 
sharp depth upshift' feature will score positively 
if s/_ i , labelled as an intergenic region, is followed 
by Si defining a transcribed region, and a sharp 
upshift in the transcription level is observed on 
mapped RNA-Seq around position b,. 



CDSi 



CDS3 



CDS2 



UTR5' 



CDS 1: ; 



UIR 



S2 IG 



UTR3' ncRNA 



Figure 1. A prokaryotic genomic sequence and the corresponding 
annotation defined as a sequence of typed regions. Each 
region has a specific label (or state) that defines its type. 
Beyond coding regions (e.g. CDS,) and intergenic regions (IG), 
an annotation may identify untranslated transcribed regions at 
the extremities of coding transcripts (5' and 3' UTRs), 
untranslated internal regions (or UIR, between two CDSs in a 
transcript) and ncRNA genes. Specific region types are also 
used to label overlapping CDSs. In this figure, the region 
labelled CDS, :3 corresponds to the overlap of a CDS in frame 1 
with another CDS in frame 3. 

(iii) Length features, lffc(s,), score the fact that a 
segment s,- has a given length. A typical example 
would be a feature scoring against extremely 
short CDSs. 

Each feature can be understood as generating votes in 
favour of some annotations. After a learning phase, 
each feature receives a weight representing a 'confi- 
dence'. The annotation that collects the maximum 
weighted sum of votes is considered as the optimal 
prediction. The usual probabilistic interpretation of 
CRFs, the formal definition of all features used inside 
EuGene-P and associated training and prediction 
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algorithms are described in Supplementary Materials 
and Methods. 

2.8. Transcriptome analysis 

Differential expression of identified genes was cal- 
culated with R v2.1 3.0 using DESeq v1 .4.1 46 available 
in Bioconductor v2.8. DESeq utilizes a negative bino- 
mial distribution for modelling read counts per tran- 
script and implements a method for normalizing the 
counts. Variance was estimated using the per-condi- 
tion argument. P-values are adjusted for multiple 
testing using the Benjamini and Hochberg method. 47 

2.9. Quantitative RT-PCR analyses 

Reverse transcription was performed using 
Superscript II reverse transcriptase (Invitrogen) with 
random hexamers as primers. RNA samples isolated 
from at least three independent experiments were 
tested for each condition. Real-time PCRs were run 
on a LightCycler system (Roche) using the FastStart 
DNA MasterPLUS SYBRGreen I kit (Roche) according 
to the manufacturer's instructions. 

For gene expression normalization, six reference 
genes were selected from the RNA-Seq data of the 
current study, on the basis of their similar levels of 
expression in both culture conditions (exponential 
and stationary growth phases) and M. truncatula 
nodules. The expression level of these genes was 
then examined by qRT-PCR in wild-type and rpoEl 
mutant strains grown at 2 8 and 40°C, and expression 
data were computed using the NormFinder applica- 
tion. 48 SMc00519 and SMb21134 were found as 
the more stably expressed genes by NormFinder and 
were therefore used as references for qRT-PCR nor- 
malization in our conditions. Oligonucleotides 
sequences used for PCR are listed in Supplementary 
Table S2. 

3. Results 

3.1 . A new integrative annotation tool for prokaryotic 
genomes 

One of the main results of this work is the definition 
of an integrative gene finder for prokaryotic gene pre- 
diction, allowing automatic incorporation of various 
sources of evidence in the prediction process, includ- 
ing oriented RNA-Seq data. The produced annotation 
not only accounts for statistical properties of observed 
open reading frames, but also for consistency with a 
variety of experimental data, thus minimizing subse- 
quent manual expert annotation work. 

We designed EuGene-P on the basis of the eukaryot- 
ic gene finder, EuGene. 14,20 EuGene is able to incorp- 
orate the various types of information for enhancing 
its predictive power and has been used for the 



annotation of several genomes. 21 ~ 27 As all recent in- 
tegrative gene finders, EuGene does not rely on a full 
generative probabilistic model, such as Hidden 
Markov Models, 49 that would require the expensive 
and unrealistic probabilistic modelling of all depend- 
encies between the available information, but on a 
dedicated discriminative model. Formally, EuGene-P 
as EuGene can be described as semi-linear CRF-, or 
SL-CRF-, 43 based predictor. A CRF is a variant of 
Markov random fields, aimed at capturing the condi- 
tional probability of a succession of unknown discrete 
random variables y = (yi ...j/n) given observed vari- 
ables x (the available evidence). From such a model, 
the values of the unknown variables y can be recon- 
structed as the most probable ones given the available 
evidence x. In gene finding, the genomic sequence 
and the available information (mapped reads, other 
similarities . . .) will be represented as the evidence x. 
The unknown (or hidden) variables y are used to rep- 
resent structural annotations. We therefore associate 
one variable y, with every base in the sequence. The 
variable y- t specifies the annotation label (or state) of 
the base at position / (inside a CDS, an intergenic 
region . . .). In eukaryotic genomes, despite the accu- 
mulating evidence of overlapping functional regions, 
existing gene finders usually assume that each base 
belongs to just one type of region. The above model, 
with one variable j/,- per base, is perfectly suitable to 
perform the gene prediction on both strands simul- 
taneously. In gene-dense prokaryotic genomes, over- 
lapping functional regions is a rather frequent event. 
Genes can overlap with neighbouring genes on 
either strand. The genomic model we chose is there- 
fore an unusual stranded model. This model describes 
how genes appear on one strand, independently of 
the other. 

Formally, we have to enumerate the list of possible 
states for a nucleotide in an annotation. As shown in 
Fig. 1 , since we restrict ourselves to a single strand, a 
typical prokaryotic sequence will contain bases 
belonging to either an intergenic region (denoted as 
/G), a transcribed non-translated region of a coding 
gene (denoted as UTR5', UIR or UTR3' depending on 
its location in the gene), a ncRNA gene (denoted as 
ncRNA), a non-overlapping CDS region in a given 
coding frame / (denoted as CDS,) or a region where 
two CDS in different coding frames / and / overlap 
(denoted as CDS, 7 ). 

Overall, each variable y h representing possible 
annotations for nucleotide /, may take 1 1 different 
states. Such states cannot appear arbitrarily in the 
genome sequence. For example, a CDS must start 
and end at specific codons. The CRF model can 
capture gene structures described as simple automa- 
ton. The automaton used in EuGene-P is described 
in Fig. 2. Transitions between possible states in the 
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automaton correspond to the occurrence of specific 
biological signals in the sequence. Transcription 
Starts and Transcription Ends denote the start and 
end of transcripts (containing coding genes or 
ncRNA genes), whereas Translation Starts and 
Translation Ends (denoted as TS,- and TE,-, respectively, 
where / is the frame of the corresponding codon in 
the sequence) enable to, respectively, start or end a 
CDS inside a transcript and possibly inside another 
CDS in a different frame. Finally, the conditional prob- 
ability distribution that relates the evidence in x and 
possible annotations in y must be described. In CRF, 
this is done through a set of features. Every type of ex- 
perimental or statistical evidence is represented by 
one (or more) feature. A feature is a small mathemat- 
ical function that uses some available evidence to vote 
in favour of (or against) the prediction of specific ele- 
ments. For example, a 'protein similarities' feature 
would vote in favour of CDS prediction in the 
regions that have similarities with known proteins. A 
precise definition of the different features available 
in EuGene-P is given in Materials and Methods. Once 
the set of features used for gene finding is fixed, the 
CRF model can be trained. This training process com- 
putes a multiplicative factor for each feature that 
determines a feature-specific confidence. The predic- 
tion is then in charge of finding the annotation that 
has maximum conditional probability. This is the 
prediction that accumulates most support from all 
features. Overall, the mathematical model and 




Figure 2. The different states and possible transitions between 
these states used inside EuGene-P. 



associated software provide a qualitative improve- 
ment in terms of its abilities in predicting TSSs, un- 
translated transcribed regions, overlapping CDSs, 
ncRNA genes and antisense genes. 

3.2. Generation of high-quality Sinorhizobium 
meliloti 201 1 genome and transcriptome 
sequencing data 
The genome sequence of the streptomycin-resist- 
ant derivative of S. meliloti strain 2011 was generated 
using a combination of 454 (Roche) and Solexa 
(lllumina) technologies that provided a total coverage 
of ~190 genome equivalents. The assembly of the 
complete genome sequence was guided by the S. meli- 
loti 1021 sequence that was determined previously. 31 
The comparison of these two DNA sequences revealed 
463 polymorphisms, including 332 SNPs, 1 19 Indels 
and 12 large deletions or insertions (>10bp; 
Supplementary Table S3). In addition to these differ- 
ences, a 3564-nt region was specifically present in 
the chromosome of Sm201 1 but not in Sm1021. 
This insertion, located between SMc03253 and 
SMc032 54, was checked and confirmed by PCR amp- 
lification. This region contains a new gene, referred 
to as SMc06990, encoding a glutamine synthetase 
domain fused to a putative carbamoyl-phosphate syn- 
thase large chain ATP-binding protein, an enzyme that 
catalyzes the production of carbamoyl phosphate, 
which can be subsequently employed in both pyrimi- 
dine and arginine biosyntheses, 50 as well as the 
SMc06992 gene which is a duplication (1 00% identi- 
cal) of the SMc032 53 gene preceded by two copies 
of its promoter region. The promoter region of 
SMc032 53 was previously shown to be a duplication 
of the whole promoter region of fixK, a gene whose ex- 
pression is controlled by the key symbiotic transcrip- 
tion regulator FixJ. 51 Sanger DNA sequencing of 
71 polymorphic regions including 64 putative frame- 
shifts showed that 55 of them were actually errors on 
the reference sequence Sm1 021, whereas eight were 
errors on the Sm201 1 sequence and only eight were 
real polymorphisms (Supplementary Table S4). These 
results suggest that presumably only ~10% of the 
463 polymorphisms are real (most being errors in 
the Sm 1021 sequence). 

To obtain a global view of the transcriptome of 
Sm2011, RNAs were prepared from bacteria grown 
in three very different physiological conditions to 
cover a large number of expressed genes. These 
include RNAs extracted from bacteria grown in 
liquid cultures (in both exponential and stationary 
growth phases) and from 1 0-day-old nodules in 
which bacteria were differentiated in nitrogen-fixing 
bacteroids. 52 For each condition, three biological repli- 
cates were performed to assess data reproducibility 
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and reliability, and short (<200 nt) and long 
(>200 nt) RNA fractions were separately analysed. 
RNA samples were depleted in both ribosomal RNAs 
and the highly abundant tRNA-Ala using a S. meii- 
/ot/-specific capture set of oligonucleotides and were 
sequenced using the stranded lllumina protocol. 53,54 
This protocol, based on ligation of adapters directly 
to the 3' and 5' ends of the RNA molecules, has the 
advantage of preserving the information about the 
transcript orientation. The RNA-Seq libraries gener- 
ated in this study are listed in Table 1 . The resulting 
sequences were mapped onto the S. meliloti genome 
sequence. RNA-Seq data appeared to be highly repro- 
ducible as shown in Supplementary Fig. S1 (Pearson 
correlation values varied from 0.899 to 0.99 8 
between biological replicates). Of the 6308 S. meliloti 
annotated CDSs (see below), the expression of 571 7 
(90%) was detected in at least one experimental con- 
dition [raw expression level summed in the six librar- 
ies (short and long) of one condition was above 50 
reads]. The number of mapped reads per nucleotide 
(summed values from triplicates) was visualized 
using the Apollo interface. 55 Figure 3 illustrates a 3- 
kb region of the genome showing short and long 
RNAs in two conditions. The expression profiles of 
bacteria grown in exponential and stationary phases 
were compared with two previous studies performed 
in similar conditions, but based on oligonucleotide 
microarrays. 30,39 Among the 804 genes found to be 
up-regulated in stationary phase in any of these 
studies, 631 genes (78%) were consistently found in 
our study to be up-regulated in the stationary phase 
(>2-fold, P< 0.05) either in the short or long RNA li- 
braries. This percentage is similar to the percentage of 



common up-regulated genes found in the two micro- 
array studies (80%), which attest to the good quality 
of our RNA-Seq data. 

3.3. Annotation of the Sinorhizobium meliloti 20 7 1 
genome using EuGene-P 
EuGene-P inherits from EuGene its ability to inte- 
grate a variety of data. Selecting the most significant 
or informative sources of evidence is highly beneficial 
for the quality of the final annotation. We decided to 
use: 

(i) Similarities with known protein sequences mod- 
elled as a dedicated feature that votes for the 
prediction of coding regions in the correspond- 
ing coding frame (see Supplementary Materials 
and Methods). To identify similarities, we used 
the SwissProt database as a reliable general 
source of information for protein similarities. In 
addition, we used the proteome of the Sm1 021 
(set of all the protein sequences obtained by 
translating all CDS of the Sm1021 annotation) 
as a more specific source of information. 

(ii) Mapped RNA-Seq data that indicate transcription 
activity. For transcribed sequences, we used RNA 
libraries of Sm201 1 in exponential or stationary 
growth conditions and libraries of S. meliloti- 
colonized M. truncatula nodule tissues. All reads 
were mapped to the S. meliloti genome 
(Table 1). The absolute expression level and the 
changes in relative expression levels were each 
exploited in a specific feature. The absolute 
expression level was used as an evidence of 
transcribed regions, while abrupt changes in 
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Figure 3. Graphical representation of a genomic region in Apollo. Apollo represents the annotation on both strands (upper and lower part 
of the figure) as well as the expression level of the mapped RNA-Seq data from short and long RNA libraries in exponential and stationary 
growth phase conditions. Reads mapped on the plus strand are shown in colour, and reads mapped on the minus strand are in grey. Y- 
axis represents the number of reads summed from triplicates. The upper limit was set at 300 reads. This region contains several 
annotated non-coding (in green) and protein-coding (in blue) genes, full blue squares correspond to CDSs and open blue squares 
correspond to 5' and 3' UTRs. 
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expression, captured by the derivative of the log- 
level of the expression, indicate a possible TSS. 

(iii) Interpolated Markov models derived from coding 
potential to help identifying coding genes. The 
3-periodic Markov models were estimated on 
CDSs from a subset of the genes in the 
Sm1 021 annotation. Those genes have a specific 
(non-automatic) gene name, indicating that 
they have gone through expert annotation. 
Because they are known to have different statis- 
tical compositions, one coding model was esti- 
mated on pSymA genes and another coding 
model estimated on genes from pSymB together 
with the chromosome. 

(iv) Output of ncRNA prediction programs to help 
identifying RNA genes from known families. 
The genomic sequence of S. meliloti was analysed 
using tRNAscan-SE v1 .23 (April 2002) for trans- 
fer RNA detection, RNAmmer (February 2006) 
for ribosomal RNAs and rfam_scan v1.0.2 with 
Rfam v10.0 (1446 families, April 2010) for 
other known ncRNA gene families. This pro- 
duced a set of genomic regions predicted as 
ncRNA genes. Each of these intervals was used 
in a feature favouring ncRNA prediction in the 
region that contains them. 

The translation Start and Stop features are generic 
(see Materials and Methods). EuGene-P allows the 
user to parameterize the definition of Stop and Start 
codons to deal with unusual codon tables. 

Overall, the purely automated annotation of 
Sm2011 produced a total of 6483 coding genes 
and 2040 ncRNA (including tRNAs and rRNAs) 
genes. This raw annotation was then submitted to 
manual checking, leading to possible curation of pre- 
dicted CDSs, UTRs and ncRNAs. Manual modifications 
were done using Apollo 55 to simultaneously visualize 
predicted elements and RNA-Seq expression levels in 
each condition (Fig. 3). Each elementary modification 
typically impacts several levels. For instance, correc- 
tions of 5'/3' ends of UTRs often corresponded to 
the removal or creation of a new ncRNA. Typically, a 
predicted ncRNA that appeared close to 5' was 
removed, and the UTR enlarged to include the corre- 
sponding region. Overall, 1 00 ncRNAs were removed 
in this way and the corresponding region included in 
a UTR (47 5' UTRs and 53 3' UTRs), while 87 UTRs 
(35 5' UTRs and 52 3' UTRs) were modified and 
new ncRNAs annotated. Around 1 3% of protein- 



coding genes and 2 9% of nc genes were modified as 
described in Table 2. However, it is important to 
note that nc genes and UTRs are difficult to discrimin- 
ate even by expert analyses of RNA-Seq data. The 
manual curation led to the final annotation described 
in Table 3 and is available on the browser https://iant. 
toulouse.inra.fr/S.meliloti201 1 . In total 6308 
protein-coding genes, 9 rRNAs, 55 tRNAs, 2 8 tRNAs 
precursors and 1 876 ncRNAs were annotated. 

3.4. Identification of a high number of putative 
non-coding RNAs in Sinorhizobium meliloti 

The number of predicted ncRNAs was remarkable 
(Table 3). All of them, but five that were only detected 
by Rfam_scan, were supported by RNA-Seq expression 
data. Because the number of predicted ncRNAs was 
surprisingly high, we compared the automated raw 
predictions (before manual curation) with the set of 
1 1 02 small RNA candidates proposed in the previous 
RNA-Seq study of Schluter et al. 56 In that study, sRNAs 
candidates were arbitrarily classified as trans- 
encoded, cis-encoded sense, cis-encoded antisense 
and mRNA leaders. Cis-encoded sense regions have 
been reported as probable mRNA degradation pro- 
ducts in Schluter et al. 56 We therefore excluded cis- 
encoded sense candidates from the comparison. We 
found that 77% of cis-encoded antisense candidates, 
76% of trans-encoded candidates and 53% of 
mRNA-leader candidates were covered on >50% of 
their length by regions that were predicted as non- 
translated transcribed regions (UTRs or ncRNA 
regions together covering 503 kb or 3.8% of all 
chromosomal and plasmid strands). 

Regarding the 1876 ncRNAs that were predicted 
after the manual curation, a large part (68%) was 
found located antisense to a protein-coding gene. 
Antisense RNAs overlap either with the 5'end (1 0%), 
the 3'end (19%) or the central part (71%) of the 
gene found on the opposite strand. These results 
strongly support the current findings that antisense 
transcription activity is more widespread in bacteria 
than initially thought. 57,58 

Our predicted ncRNAs displayed an average size of 
107nt, 94% ranging between 20 and 250 nt 
(Supplementary Fig. S2). This length distribution is 
consistent with the sizes of 50-348 nt observed by 
Schluter et al. 56 Besides the 55 tRNAs, the nine 
rRNAs and the five well-characterized ncRNAs (ffs, 
ssrS, ssrA, rnpB and incA), only 36 additional ncRNA 



Table 2. Modifications performed on the automatic annotation during the manual curation process 



Type 


5' ends 


3' ends 


CDS starts 


CDS stops 


Creations 


Removals 


Total number of modified genes 


Coding genes 


350 


275 


135 


2 


19 


1 94 


835 


Non-coding genes 


31 


151 






1 80 


252 


604 
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Table 3. Structural annotation of the S. meliloti 201 1 genome 



CDSs (total number) 6308 

New (when compared with Sm1 021) 125 

tRNAs 55 

tRNA primary transcripts 28 

rRNAs 9 

ncRNAs 1876 

Antisense to a protein-coding gene 1 281 

TSSs (total number) 4840 

Predicted with high confidence 4077 

Predicted with low confidence 763 

Insertion sequences 94 

Repeated elements 618 

RIME 209 

MOTIF 2 56 

Sm-1 repeat 21 

Sm-2 repeat 8 

Sm-3 repeat 4 

Sm-4 repeat 73 

Sm-5 repeat 47 



were classified by Rfam_scan in 1 8 known ncRNA 
families (Supplementary Table S5). A majority of 
them has thus a completely unknown function. 
Interestingly, analysis of expression patterns indicated 
that a large part of predicted ncRNAs were differen- 
tially expressed (> or < 2-fold, P< 0.01) between at 
least two of the three conditions studied: 1 52 were 
induced in symbiosis compared with free-living condi- 
tions while 1116 were induced, and 317 were 
repressed in stationary phase when compared with 
exponential growth phase (Supplementary Tables S6 
and S7). These expression patterns support the idea 
that ncRNAs potentially play important regulatory 
functions in S. meliloti under these conditions. 

Consistently with the study of Schlutereta/. 56 inter- 
genic repeated elements previously identified in the 
genome of S. meliloti, like the RIME, MOTIF or Sm-1 
to Sm-5 repeats, 59-61 were also transcribed and, 
thus, further increase the number of non-translated 
transcribed elements. Since reads corresponding to 
such repeated sequences could not be unambiguously 
mapped, it was difficult to estimate their relative ex- 
pression levels and to determine whether they were 
all transcribed at a similar level. 



3.5. EuGene-P identifies TSSs and efficiently delineates 
5' UTRs ofmRNAs 
The RNA-Seq protocol used here allowed us to pre- 
cisely predict the 5' ends of RNAs. This is related to the 
fact that, prior to library constructions, RNA molecules 



were treated with the tobacco acid pyrophosphatase 
that converts the 5' triphosphate group of native tran- 
scripts into a 5' monophosphate capable of ligation 
with oligonucleotide adaptors (see Materials and 
methods). This procedure enabled the sequencing of 
5' RNA ends with a very high precision and thereby 
the identification of probable TSSs. TSS prediction 
was based on the identification of abrupt changes in 
expression level as assessed by the approximation of 
the derivative of the expression level logarithm. In 
total, 4077 TSSs of protein-coding genes or nc genes 
were predicted with good confidence (clear changes 
in expression), whereas 763 were predicted with a 
lower confidence. Compared with the existing 
Sm1021 annotation, 31 505 conserved CDSs had a 
modified start codon. This was a direct consequence 
of RNA-Seq data integration since the previously pre- 
dicted start codon was usually located before the 
TSS predicted from RNA-Seq data, showing the inter- 
est of integrating RNA-Seq data for gene annotation. 

To further evaluate EuGene-P predictions, we com- 
pared our data with TSSs experimentally mapped in 
previous studies. Prokaryotic transcription initiates in 
promoter DNA regions, defined by the presence of 
binding sites for a dissociable RNA polymerase 
subunit called sigma factor. 62 To date, seven S. meliloti 
sigma factors (among 1 5) are known to be active in at 
least one of the experimental conditions tested here: 
the vegetative sigma factor (RpoD, or sigma 70) and 
the alternative sigma factors RpoN, RpoH1, RpoH2, 
RpoE2, RpoE1 and RpoE4. 39 > 63 " 66 The TSSs of >1 00 
promoters known or supposed to be controlled by 
either one of these sigma factors were experimentally 
mapped in various studies (Table 4). The TSSs anno- 
tated from the transcript 5' ends mapped in the 
present study are in good agreement with these 
data, as 72% of the experimentally mapped TSSs 
match (+5nt) our annotated TSSs (Table 4 and 
Supplementary Table S8). Several authors used the 
consensus promoter sequences deduced from these 
experimentally determined TSSs, combined or not 
with microarray or Affimetrix data, to predict >200 
additional putative targets of these sigma factors 
(Table 4). The good congruence of these predictions 
with our data (74%) further strengthens our annota- 
tion (Table 4 and Supplementary Table S8). Note, 
however, that the number of correctly annotated 
TSSs was found to be positively correlated with the 
number of reads. Indeed, TSS annotations based on 
small numbers of reads appeared unreliable (24% of 
congruence), whereas TSSs covered by >50 sequen- 
cing reads were found to match more frequently the 
experimentally determined or in silico predicted TSSs 
(82 and 77%, respectively). Caution should therefore 
be taken with weakly expressed genes. Among other 
mis-annotated TSSs are those corresponding to 
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Table 4. Congruence between TSS annotation and the published 
literature 



Fraction of annotated TSS a matching: 



Experimentally In silico 

mapped TSS b predicted TSS b 



RpoD 


22/27 


63/89 


RpoH1 and/or RpoH2 


45/67 


49/69 


RpoE1 and/or RpoE4 


3/4 




RpoE2 


1/1 


29/35 


RpoN 


3/4 


5/6 c 



All mapped or predicted promoter sequences are available in 
Supplementary Table S8. 

a Genes for which no TSS was annotated in the current study 
were not retained for this table. 

b Data extracted from 67 ' 85,86 (RpoD) 65 (RpoH1 /H2), 39 ' 40 - 70 
(RpoE2), 66 (RpoE1 /E4) 87 " 94 (RpoN). 
c As the coordinates of RpoN TSS predicted by Dombrecht 
et al. 94 were not described in their paper, we kept the pro- 
moters carrying the most obvious -24/-12 RpoN- 
binding sequences. 

processed transcripts, such as tRNA and rRNA, for 
which only 4 of 1 4 annotated TSSs match the 
predicted or experimentally determined TSSs 
(Supplementary Table S8). Finally, to evaluate the 
proportion of annotated 5'ends corresponding to 
actual TSSs, we reasoned that most of the promoters 
not analysed above should be recognized by the vege- 
tative sigma factor RpoD. An in silico search revealed 
that >1/3 of them contain putative RpoD-binding 
sequences (Supplementary Table S9), as defined by 
MacLellan et al. 67 Altogether, these observations 
therefore suggest that a large number of the anno- 
tated 5'ends indeed correspond to actual TSSs. 

Interestingly, manual inspection of transcription 
data allowed the identification of 33 CDSs having dif- 
ferent TSSs depending on experimental conditions 
(Supplementary Tables S6 and S7). 

The length of annotated 5' UTRs ranges between 1 
and 839 nt and displays a median size of 45 nt, which 
is similar to the median length of 5' UTRs observed in 
Escherichia coli (37 nt), 68 Synechococcus elongatus 
(33 nt) 2 , Geobacter sulfurreducens (37 nt) 1 or 
Agrobacterium tumefaciens (61 nt). 69 

3.6. Reappraisal of the Sinorhizobium meliloti RpoE2 
regulon 

The genome-wide determination of TSSs should 
make it possible to extend our knowledge of regulons 
by looking for the conserved binding sites of regula- 
tors in promoter regions. We tested this idea on the 
RpoE2 regulon. RpoE2 is an extracytoplasmic function 
sigma factor involved in the general stress response of 
S. meliloti and is activated under various conditions, 



including heat shock, salt stress or entry into station- 
ary phase following nitrogen or carbon starvation. 39 
This sigma factor was found in previous studies to 
target <40 S. meliloti promoters. 39,40 ' 70,71 To re- 
evaluate the extent of the RpoE2 regulon, we screened 
all DNA regions located 5-1 1 nt upstream of 5' tran- 
script ends for the presence of the strictly conserved 
RpoE2-binding sequence (GGAAC N 18 _, 9 TT). 39 We 
identified 1 08 transcription units that meet this 
criterion, including 26 putative ncRNAs 
(Supplementary Table S1 0). That most of these 
sequences correspond to genuine RpoE2-controlled 
promoters was validated by the following observa- 
tions : (i) 30 of them were previously reported as 
RpoE2 targets, 39,70,71 (ii) transcription from 86% of 
the newly identified promoters (67 of 78) was 
found in the current study as being up-regulated 
(> 2-fold, P< 0.001) in stationary phase (a known 
RpoE2-activating condition; Supplementary Table 
S1 0) and finally (iii) using qRT-PCR, we confirmed 
that transcription from 6 of 6 randomly chosen pro- 
moters (four mRNAs and two ncRNAs) is up-regu- 
lated, either following a heat shock or entry in 
stationary phase (two RpoE2-activating conditions), 
in the wild type but not in a rpoE2 mutant strain 
(Supplementary Fig. S3). Altogether, these observa- 
tions further validate TSS annotations predicted by 
EuGene-P and give a demonstration of its power to 
extend the knowledge of a given regulon. 



4. Discussion 

Through RNA sequencing, NGS technologies give 
access to prokaryotic transcriptomes with an unprece- 
dented resolution and provide a massive amount of 
novel information on genome organization. In this 
work, we took advantage of data produced from the 
legume bacterial symbiont S. meliloti to develop a 
new bioinformatic tool that exploits transcription 
data for exhaustive annotation of prokaryotic 
genomes. The oriented RNA-Seq data that were pro- 
duced from Sm2011 in stationary and exponential 
phases as well as in symbiotic condition have excellent 
reproducibility, with highly consistent triplicates and a 
good congruence when compared with previously 
published data. The analysis of both short and long 
fractions of RNAs enabled the identification of tran- 
scribed biological objects of small length, like 
ncRNAs and short CDSs, which could have been lost 
with usual RNA preparation protocols. The Sm2 01 1 
oriented RNA sequencing also showed a complex 
landscape of expression on both strands. Such com- 
plexity would have been completely hidden by non- 
oriented sequencing, possibly leading to biased 
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expression level measurements as well as a poorer 
genome annotation. 

Oriented RNA-Seq data give an opportunity to 
define a new generation of integrative prokaryotic 
genome annotation tools. In the area of prokaryotic 
genome annotation, existing NGS-related studies 11 
have focussed on the possibly increased level of se- 
quencing errors associated with such technologies. 
Here, we showed that the quality of the Sm201 1 
genomic sequence obtained by NGS is comparable 
with, if not better than, the Sm1021 genomic se- 
quence previously generated by Sanger sequencing. 31 
Using oriented RNA-Seq data, the EuGene-P proved to 
be able to automatically produce a complex annota- 
tion with novel coding and nc genes, including many 
antisense genes, untranslated 5' and 3' regions and 
precise mapping of 5' TSSs. To the best of our knowl- 
edge, EuGene-P is the first prokaryotic gene finder 
that is able to predict a comprehensive genome anno- 
tation. The ability to predict highly overlapping func- 
tional regions is directly inherited from the strand- 
specific prediction process, which is itself consistent 
with oriented RNA-Seq data. Predicting genes on 
each strand independently has historically been con- 
sidered as a bad idea given that the gene contents 
of the two DNA strands are highly correlated. 
However, ncRNA genes and specifically antisense 
genes blur this idea, which is already shaken by over- 
lapping CDSs and transcripts. Strand-specific predic- 
tion and oriented RNA-Seq allow dealing with this 
complex situation directly. 

The quality of the Sm201 1 automatic annotation 
was validated by in-depth manual curation. A relative- 
ly limited number of manual modifications were 
made using Apollo for the simultaneous visualization 
of per-triplicated bank expression levels and annota- 
tion on both strands. The distinction between 3' and 
5' UTRs and nearby ncRNA genes remains difficult 
and is still questionable even in the expert annotation. 
Beyond this, the resulting final annotation led to the 
definition of accurate gene structures, which is very 
useful for biologists to better understand the organ- 
ization of genes and to characterize their function 
and regulation. 

The number of predicted ncRNA genes is particular- 
ly high in S. meliloti, even though we cannot rule out 
that some of them encode peptides or small proteins. 
Most predicted ncRNA regions are consistently sup- 
ported by RNA-Seq data. The fact that a large propor- 
tion of predicted ncRNAs are differentially expressed 
between the three physiological conditions analysed 
suggests that they are probably not artefacts intro- 
duced either by cDNA library preparation or by se- 
quencing protocols. Moreover, the list of ncRNA 
genes predicted in a previous RNA-Seq study 56 is 
also largely covered by our predicted nc transcripts, 



despite the fact that it represents only a small fraction 
of the genome. Among the 1 876 predicted ncRNAs, 
29 have been experimentally validated by northern 
blot or 5'-RACE analyses in previous studies. 56,72-74 
Beside tRNAs, rRNAs and the five well-conserved and 
well-characterized ncRNAs, 4.5S RNA (SRP, ffs), 6S 
RNA (ssrS), tmRNA (ssrA), the ribozyme RNase P 
(rnpB) and incA that mediates plasmid incompatibility 
phenotypes, 75 36 ncRNAs belong to known families 
described in the Rfam database, whereas the remain- 
ing predicted ncRNAs could not be assigned to a 
given class. A lot of work thus remains to be done to 
validate the existence of predicted ncRNAs and to 
elucidate their function in S. meliloti. Interestingly, 
454-sequencing of small ncRNAs of A. tumefaciens, a 
bacterium phylogenetically close toS. meliloti, recently 
revealed the presence of numerous small RNAs on all 
four replicons. 69 The number of ncRNAs in S. meliloti 
would be even higher if widespread repeated elements 
like the RIME, MOTIF and Sm-1 to Sm-5 repeats 59 " 61 , 
that appeared to be highly transcribed elements, were 
taken into account. Similar repeated regions, like bac- 
terial interspersed mosaic element and boxC DNA 
repeat elements, have also been shown to be tran- 
scribed in E. coli and to play key roles in transcription 
attenuation 76 or mRNA stabilization. 77,78 More recent- 
ly, they have also been demonstrated to be involved in 
nucleoid morphology and chromosome formation and 
maintenance. 79 

A large proportion of S. meliloti ncRNAs were found 
to map antisense to annotated protein-coding genes. 
With oriented RNA-Seq data, antisense transcription 
now appears to be a common and widespread phe- 
nomenon in bacteria as recently reported for £. coli, 
in which 1005 antisense RNAs were identified, 80,81 
and Helicobacter pylori, in which 46% of CDSs are 
overlapping with at least one antisense RNA. 82,83 
Several mechanisms of the action of antisense RNAs 
in bacteria have been recently reviewed. 84 They 
include the alteration of target RNA stability, the 
modulation (inhibition or activation) of translation, 
transcriptional interference and attenuation. 
Antisense RNA-mediated regulation thus likely 
appears as an important component of complex regu- 
latory pathways controlling gene expression in bac- 
teria. However, it was recently suggested by Nicolas 
et al. 83 that some antisense RNAs can potentially 
arise from spurious transcription initiation or from 
imperfect control of transcription termination. 

In this study, we also provided a detailed map of S. 
meliloti TSSs. This high-resolution TSS map is in agree- 
ment with previous in silico predicted or experimen- 
tally determined TSSs, in which 72% of validated 
TSSs matched our annotated TSSs by + 5 nt. These 
data will greatly facilitate the study of promoter 
regions, the identification of protein-binding motifs 
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and the determination of regulons in S. meliloti. This 
was done here for the RpoE2 regulon that appears 
to be almost three times larger than previously deter- 
mined using classical approaches. 39 

Oriented bacterial RNA-Seq data also unveil more 
complex mechanisms, such as alternative transcrip- 
tion starts, depending on the experimental condition 
(exponential or stationary phase). In our expression 
data, we identified 33 genes that displayed multiple 
TSSs. The frequency of multiple TSSs would have prob- 
ably been higher if more physiological conditions had 
been analysed. Indeed, it was shown in E. coli and 
Bacillus subtilis that 35 and 46% of genes, respectively, 
have multiple TSSs. 68,83 This type of adaptive behav- 
iour is currently difficult to represent and raises new 
problems for automatic genome annotation and 
visualization. 

In conclusion, we developed a new generic tool, 
EuGene-P, to automatically and accurately annotate 
prokaryotic genomes by integrating genome-wide ex- 
perimental data, such as RNA-Seq data. This tool was 
used to re-visit the structural annotation of S. meliloti, 
providing a much more complete and comprehensive 
view of its genome architecture. The ability of 
EuGene-P to identify nc transcribed elements as well 
as to precisely map TSSs offers a new view of prokaryot- 
ic genomes and should greatly contribute to our under- 
standing of gene regulation and function in bacteria. 
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